Forecasting based on bernoulli uncertainty characterization

ABSTRACT

This disclosure relates to predictions based on a Bernoulli uncertainty characterization used in selecting between different prediction models. An example system is configured to perform operations including determining a prediction by a first prediction model. The first prediction model is associated with a loss function. The system is also configured to determine whether the prediction is associated with the first prediction model or a second prediction model based on a joint loss function. The second prediction model is associated with a likelihood function, and the joint loss function is based on the loss function and the likelihood function. The system is further configured to indicate the prediction to the user in response to determining that the prediction is associated with the first prediction model. If the prediction is associated with the second prediction model, the system may prevent indicating the prediction to the user.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation-in-part application of, andclaims priority to, U.S. patent application Ser. No. 17/115,297 entitled“FORECASTING BASED ON BERNOULLI UNCERTAINTY CHARACTERIZATION” and filedon Dec. 8, 2020, which is assigned to the assignee hereof. Thedisclosures of all prior applications are considered part of and areincorporated by reference in this patent application.

TECHNICAL FIELD

This disclosure relates generally to systems for data prediction basedon a Bernoulli uncertainty characterization used in selecting betweendifferent prediction models to generate the prediction.

DESCRIPTION OF RELATED ART

Various computer implemented prediction models are used to forecastvarious data of interest to a user. For example, various predictionmodels are used to forecast real estate values, stock market or otherasset prices, completion times for projects, and so on. Users may useone or more models to forecast cash flow, revenue, liquidity, and so onof a business from invoices, sales, expenses, and other businessrecords. However, such models are not faultless. For example, on the offchance a computer system implementing a model indicates an inaccuratecash flow prediction to the user, the user may determine a business'future operations based on the inaccurate cash flow prediction.

SUMMARY

This Summary is provided to introduce in a simplified form a selectionof concepts that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tolimit the scope of the claimed subject matter. Moreover, the systems,methods, and devices of this disclosure each have several innovativeaspects, no single one of which is solely responsible for the desirableattributes disclosed herein.

One innovative aspect of the subject matter described in this disclosurecan be implemented as a method for dynamically selecting a forecastingmodel. An example method may include retrieving a number of data points,generating, using a machine learning model, confidence values indicatingwhether a first forecasting model or a second forecasting model is morelikely to generate an accurate prediction for each of the number of datapoints, selecting the first forecasting model or the second forecastingmodel for each of the number of data points based on the respectiveconfidence values, generating, for each of the number of data points,prediction data using the selected one of the first forecasting model orthe second forecasting model, and training the machine learning model togenerate more accurate confidence values for data points based on theprediction data.

In some aspects, each of the number of data points is associated with acorresponding customer. In some implementations, the method may furtherinclude outputting, for each data point, at least one of the selectedforecasting model or the prediction data to the corresponding customer.In some other implementations, generating the confidence values is basedon a predicted mean associated with the first forecasting model, apredicted standard deviation associated with the first forecastingmodel, a fixed mean associated with the second forecasting model, and afixed standard deviation associated with the second forecasting model.In some aspects, a confidence value of 1 indicates that the firstforecasting model is more likely to generate an accurate prediction fora respective data point than the second forecasting model, and aconfidence value of 0 indicates that the first forecasting model is lesslikely to generate an accurate prediction for the respective data pointthan the second forecasting model.

In some implementations, training the machine learning model includesgenerating a joint uncertainty function associated with the first andsecond forecasting model. In some instances, the method may furtherinclude identifying a first uncertainty function associated with thefirst forecasting model, the first uncertainty function indicating adegree of error of the first forecasting model for a given data point,and identifying a second uncertainty function associated with the secondforecasting model, the second uncertainty function indicating a degreeof error of the second forecasting model for the given data point, wheregenerating the joint uncertainty function is based on the firstuncertainty function and the second uncertainty function. In some otherinstances, the method may further include updating one or more previousjoint uncertainty functions with the generated joint uncertaintyfunction.

In some other implementations, the method may further include generatinga total likelihood value jointly associated with the first and secondforecasting model based on the joint uncertainty function. In someinstances, the method may further include generating a loglikelihoodvalue jointly associated with the first and second forecasting model anda negative loglikelihood value jointly associated with the first andsecond forecasting model based on the total likelihood value.

Another innovative aspect of the subject matter described in thisdisclosure can be implemented in a system for dynamically selecting aforecasting model. An example system may include one or more processorsand a memory storing instructions for execution by the one or moreprocessors. Execution of the instructions may cause the system toretrieve a number of data points, generate, using a machine learningmodel, confidence values indicating whether a first forecasting model ora second forecasting model is more likely to generate an accurateprediction for each of the number of data points, select the firstforecasting model or the second forecasting model for each of the numberof data points based on the respective confidence values, generate, foreach of the number of data points, prediction data using the selectedone of the first forecasting model or the second forecasting model, andtrain the machine learning model to generate more accurate confidencevalues for data points based on the prediction data.

In some aspects, each of the number of data points is associated with acorresponding customer. In some implementations, execution of theinstructions may further cause the system to output, for each datapoint, at least one of the selected forecasting model or the predictiondata to the corresponding customer. In some other implementations,generating the confidence values is based on a predicted mean associatedwith the first forecasting model, a predicted standard deviationassociated with the first forecasting model, a fixed mean associatedwith the second forecasting model, and a fixed standard deviationassociated with the second forecasting model. In some aspects, aconfidence value of 1 indicates that the first forecasting model is morelikely to generate an accurate prediction for a respective data pointthan the second forecasting model, and a confidence value of 0 indicatesthat the first forecasting model is less likely to generate an accurateprediction for the respective data point than the second forecastingmodel.

In some implementations, training the machine learning model includesgenerating a joint uncertainty function associated with the first andsecond forecasting model. In some instances, execution of theinstructions may further cause the system to identify a firstuncertainty function associated with the first forecasting model, thefirst uncertainty function indicating a degree of error of the firstforecasting model for a given data point, and identify a seconduncertainty function associated with the second forecasting model, thesecond uncertainty function indicating a degree of error of the secondforecasting model for the given data point, where generating the jointuncertainty function is based on the first uncertainty function and thesecond uncertainty function. In some other instances, execution of theinstructions may further cause the system to update one or more previousjoint uncertainty functions with the generated joint uncertaintyfunction.

In some other implementations, execution of the instructions may furthercause the system to generate a total likelihood value jointly associatedwith the first and second forecasting model based on the jointuncertainty function. In some instances, execution of the instructionsmay further cause the system to generate a loglikelihood value jointlyassociated with the first and second forecasting model and a negativeloglikelihood value jointly associated with the first and secondforecasting model based on the total likelihood value.

BRIEF DESCRIPTION OF THE DRAWINGS

Details of one or more implementations of the subject matter describedin this disclosure are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages will becomeapparent from the description, the drawings, and the claims. Note thatthe relative dimensions of the following figures may not be drawn toscale.

FIG. 1 shows a block diagram of a system to indicate a prediction to auser, according to some implementations.

FIG. 2 shows an illustrative flowchart depicting an example operationfor indicating a prediction to a user, according to someimplementations.

FIG. 3 shows an illustrative flowchart depicting an example operationfor training prediction models used in determining a prediction,according to some implementations.

FIG. 4 shows an illustrative flow chart depicting an example operationfor dynamically selecting a forecasting model, according to someimplementations.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

The following description is directed to certain implementations fordetermining and indicating a prediction to a user. The prediction may bedetermined based on a Bernoulli uncertainty characterization, with aBernoulli variable used in selecting between different prediction modelsto generate the prediction. However, a person having ordinary skill inthe art will readily recognize that the teachings herein can be appliedin a multitude of different ways. It may be readily understood thatcertain aspects of the disclosed systems and methods can be arranged andcombined in a wide variety of different configurations, all of which arecontemplated herein.

A model may be trained to forecast cash flow or other business metrics.For example, a computer system may use the model to predict cash flowfor one or more future points in time, and the system may indicate thepredictions to a user. The user then directs future business decisionsin light of the predictions. Since the user may direct future businessdecisions in light of the predictions, there is a need for thepredictions used in directing business decisions to be accurate (such asmore accurate than a simplistic model, including a guess based on aparametric distribution of possible predictions). Inaccurate predictionsmay negatively affect future business operations determined in light ofthe predictions. In addition, inaccurate predictions may cause the userto lose trust in the system or model.

In addition, a user may be interested only in predictions that divergefrom what is expected. For example, when cash flow of a business issteady, a user may be interested in a predicted change in cash flowgreater than a threshold that may significantly impact future businessoperations (such as a sudden loss or increase that may affectliquidity). As a result of the system constantly indicating thepredictions to the user, the user is compelled to decipher whichpredictions are important and which are unimportant. Yet the sheernumber of predictions and the vast amounts of business data influencingthe predictions makes it impracticable for a user to determine whichpredictions are of interest within an acceptable amount of time (muchless in real time).

As such, there is a need to prevent inaccurate predictions from beingindicated to a user. There is also a need to filter which predictionsare indicated to a user so that the user is apprised only of thepredictions of interest.

In some implementations, a system can filter predictions to be indicatedto a user to improve the accuracy of the predictions and the relevanceof the predictions to the user. The system may use multiple predictionmodels to generate predictions, and the system may determine if andwhich predictions are to be indicated to the user based on which modelto which a prediction is attributed. For example, the system may use atrained prediction model (such as a machine learning model or othersuitable model) to generate a prediction, and the system then determineswhether the prediction can just as easily be attributed to a controlprediction model (which may be a simple prediction model defined by aparametric distribution or Quantile regression of the input data)instead of the trained prediction model. If the prediction is determinedto be associated with the trained prediction model instead of thecontrol prediction model (such as indicating that the prediction variesfrom the probability distribution associated with a simple predictionmodel), the system indicates the prediction to the user. If the trainedmodel's prediction cannot be attributed to the trained model (indicatingthat the simple prediction model may be at least as effective inpredicting than the trained model for that particular instance), thesystem prevents the trained model's prediction from being indicated tothe user. In this manner, the system causes the predictions indicated tothe user to be of more relevance and with a higher confidence orlikelihood.

Various aspects of the present disclosure provide a unique computingsolution to a unique computing problem that did not exist. Morespecifically, the problem of filtering computer generated predictionsdid not exist prior to the use of computer implemented models forprediction based on vast numbers of financial or other electroniccommerce-related transaction records, and is therefore a problem rootedin and created by technological advances in businesses to accuratelydifferentiate between inaccurate and accurate predictions and importantand unimportant predictions.

As the number of transactions and records increases, the ability toidentify and indicate predictions of importance (and thus be able todetermine a plan of action based on the predictions) requires thecomputational power of modern processors and machine learning models toaccurately identify such predictions, in real-time, so that appropriateaction can be taken. Therefore, implementations of the subject matterdisclosed herein are not an abstract idea such as organizing humanactivity or a mental process that can be performed in the human mind,for example, because it is not practical, if even possible, for a humanmind to evaluate the transactions of thousands to millions, or more, atthe same time to identify each prediction's accuracy and importance.

In the following description, numerous specific details are set forthsuch as examples of specific components, circuits, and processes toprovide a thorough understanding of the present disclosure. The term“coupled” as used herein means connected directly to or connectedthrough one or more intervening components or circuits. The terms“processing system” and “processing device” may be used interchangeablyto refer to any system capable of electronically processing information.Also, in the following description and for purposes of explanation,specific nomenclature is set forth to provide a thorough understandingof the aspects of the disclosure. However, it will be apparent to oneskilled in the art that these specific details may not be required topractice the example implementations. In other instances, well-knowncircuits and devices are shown in block diagram form to avoid obscuringthe present disclosure. Some portions of the detailed descriptions whichfollow are presented in terms of procedures, logic blocks, processing,and other symbolic representations of operations on data bits within acomputer memory.

In the figures, a single block may be described as performing a functionor functions. However, in actual practice, the function or functionsperformed by that block may be performed in a single component or acrossmultiple components, and/or may be performed using hardware, usingsoftware, or using a combination of hardware and software. To clearlyillustrate this interchangeability of hardware and software, variousillustrative components, blocks, modules, circuits, and steps have beendescribed below generally in terms of their functionality. Whether suchfunctionality is implemented as hardware or software depends upon theparticular application and design constraints imposed on the overallsystem. Skilled artisans may implement the described functionality invarying ways for each particular application, but such implementationdecisions should not be interpreted as causing a departure from thescope of the present disclosure. Also, the example systems and devicesmay include components other than those shown, including well-knowncomponents such as a processor, memory, and the like.

Several aspects of prediction analysis and indicating predictions to auser for a business will now be presented with reference to variousapparatus and methods. These apparatus and methods will be described inthe following detailed description and illustrated in the accompanyingdrawings by various blocks, components, circuits, devices, processes,algorithms, and the like (collectively referred to herein as“elements”). These elements may be implemented using electronichardware, computer software, or any combination thereof. Whether suchelements are implemented as hardware or software depends upon theparticular application and design constraints imposed on the overallsystem.

By way of example, an element, or any portion of an element, or anycombination of elements may be implemented as a “processing system” thatincludes one or more processors. Examples of processors includemicroprocessors, microcontrollers, graphics processing units (GPUs),central processing units (CPUs), application processors, digital signalprocessors (DSPs), reduced instruction set computing (RISC) processors,systems on a chip (SoC), baseband processors, field programmable gatearrays (FPGAs), programmable logic devices (PLDs), state machines, gatedlogic, discrete hardware circuits, and other suitable hardwareconfigured to perform the various functionality described throughoutthis disclosure. One or more processors in the processing system mayexecute software. Software shall be construed broadly to meaninstructions, instruction sets, code, code segments, program code,programs, subprograms, software components, applications, softwareapplications, software packages, routines, subroutines, objects,executables, threads of execution, procedures, functions, etc., whetherreferred to as software, firmware, middleware, microcode, hardwaredescription language, or otherwise.

Accordingly, in one or more example implementations, the functionsdescribed may be implemented in hardware, software, or any combinationthereof. If implemented in software, the functions may be stored on orencoded as one or more instructions or code on a computer-readablemedium. Computer-readable media includes computer storage media. Storagemedia may be any available media that can be accessed by a computer. Byway of example, and not limitation, such computer-readable media caninclude a random-access memory (RAM), a read-only memory (ROM), anelectrically erasable programmable ROM (EEPROM), optical disk storage,magnetic disk storage, other magnetic storage devices, combinations ofthe aforementioned types of computer-readable media, or any other mediumthat can be used to store computer executable code in the form ofinstructions or data structures that can be accessed by a computer.

FIG. 1 shows a block diagram of a system 100 to indicate a prediction toa user, according to some implementations. Although described herein aspredictions with respect to cash flow of a business, in some otherimplementations, the predictions may be with respect to revenue, invoicepayments, asset prices, or any other suitable predictions that may ormay not be business related. The system 100 is shown to include aninput/output (I/O) interface 110, a database 120, one or more processors130, a memory 135 coupled to the one or more processors 130, a firstprediction model 140, a second prediction model 150, a selection model160, and a data bus 180. The various components of the system 100 may beconnected to one another by the data bus 180, as depicted in the exampleof FIG. 1. In other implementations, the various components of thesystem 100 may be connected to one another using other suitable signalrouting resources.

The interface 110 may include any suitable devices or components toobtain information (such as input data) to the system 100 and/or toprovide information (such as output data) from the system 100. In someinstances, the interface 110 includes at least a display and an inputdevice (such as a mouse and keyboard) that allows users to interfacewith the system 100 in a convenient manner. The interface 110 mayindicate one or more predictions determined by one or more of theprediction models 140 and 150. Example indications may include a visualindication (such as indicating the prediction to a user via a display).

The input data includes data provided to the prediction models 140 and150 to generate predictions. The input data may include training data totrain the models 140-160 or data used for operation of the trainedmodels to determine predictions to be indicated to a user. For example,if the prediction models predict cash flow of a business, example inputdata includes payments, invoices, or other known business activity.While the examples herein are described with reference to predictingcash flow, the system 100 may be configured to predict any suitablemetric of interest to a user.

The input data is associated with a plurality of features and responsesused in predicting future cash flow. Example features includetransactions involving vendors, clients, or other entities that mayinfluence the predictions. For example, features may include fees froman invoice collected from a client, fees paid to a vendor, taxes paid,or other measured transactions that may affect cash flow. Responsesinclude changes to the cash flow based on the features. The notation ofthe feature-response pairs of the input data is (x_(i),y_(i)) forinteger i from 1 to N and x_(i) and y_(i) being real numbers. While theexamples herein of input data, generating predictions, and indicatingpredictions to a user are provided in a univariate setting for clarityin explaining aspects of the present disclosure, the operationsdescribed herein may also be performed in a multivariate setting.

The database 120 can store any suitable information relating to theinput data or the predictions. For example, the database 120 can storetraining data or operational data received via the interface 110,previous predictions, variable information or other information aboutthe models 140-160, or other suitable information. In some instances,the database 120 can be a relational database capable of manipulatingany number of various data sets using relational operators, and presentone or more data sets and/or manipulations of the data sets to a user intabular form. The database 120 can also use Structured Query Language(SQL) for querying and maintaining the database, and/or can storeinformation relevant to the predictions in tabular form, eithercollectively in a table or individually for each prediction.

The one or more processors 130, which may be used for general dataprocessing operations (such as transforming data stored in the database120 into usable information), may be one or more suitable processorscapable of executing scripts or instructions of one or more softwareprograms stored in the system 100 (such as within the memory 135). Theone or more processors 130 may be implemented with a general purposesingle-chip or multi-chip processor, a digital signal processor (DSP),an application specific integrated circuit (ASIC), a field programmablegate array (FPGA) or other programmable logic device, discrete gate ortransistor logic, discrete hardware components, or any combinationthereof designed to perform the functions described herein. In one ormore implementations, the one or more processors 130 may be implementedas a combination of computing devices (such as a combination of a DSPand a microprocessor, a plurality of microprocessors, one or moremicroprocessors in conjunction with a DSP core, or any other suchconfiguration).

The memory 135 may be any suitable persistent memory (such as one ormore nonvolatile memory elements, such as EPROM, EEPROM, Flash memory, ahard drive, etc.) that can store any number of software programs,executable instructions, machine code, algorithms, and the like that,when executed by the one or more processors 130, causes the system 100to perform at least some of the operations described below withreference to one or more of the Figures. In some instances, the memory135 can also store training data, seed data, and/or training data forthe components 140-160.

The first prediction model 140 can be used to generate one or morepredictions from the data obtained by the system 100. For example, thefirst prediction model 140 predicts one or more future data points incash flow of a business. In some implementations, the first predictionmodel 140 is a machine learning model based on one or more of decisiontrees, random forests, logistic regression, nearest neighbors,classification trees, control flow graphs, support vector machines,naïve Bayes, Bayesian Networks, value sets, hidden Markov models, orneural networks configured to predict one or more data points from theinput data. However, the first prediction model 140 may be any suitableprediction model (including user defined or supervised models). Thefirst prediction model 140 is the primary prediction model 140 of thesystem 100. In this manner, the user is interested in the predictionsfrom the first prediction model, and the system 100 may indicate thepredictions from the first prediction model 140 to the user.

The second prediction model 150 is a prediction model to generate asecond set of predictions. For example, the second prediction model 150may be used in evaluating the predictions of the first prediction model140. In some implementations, the second prediction model 150 is apredefined prediction model, such as a statistical model defined by aprobability distribution. For example, the second prediction model 150is a regression model based on a parametric distribution of noise in theinput data (such as a Gaussian distribution, Poisson distribution, orother known distributions). For a Gaussian distribution including a meanand standard deviation, the mean and standard deviation define thesecond prediction model 150. However, any suitable distribution or modelmay be used. In another example, the probability distribution of thesecond prediction model 150 is based on quantiles (such as quantiles at10 percent increments of confidence or any other suitable confidenceintervals). The second prediction model 150 attempts to generatepredictions from the same dataset used by the first prediction model 140to generate predictions. In this manner, the predictions between themodels 140 and 150 may be compared to each other. In one example, thesecond prediction model 150 may be considered a control model whosepredictions are to be used in analyzing the predictions from the firstprediction model 140. For example, if a prediction from the firstprediction model 140 can be just as easily attributed to the secondprediction model 150 than the first prediction model 140 (such as theprediction not varying by more than a tolerance from what the secondprediction model 150 would predict), the system 100 may be configured toprevent indicating the prediction to the user. Such comparison anddetermination may be performed using the selection model 160. While thesystem 100 is depicted as including two prediction models, the system100 may include any suitable number of prediction models (such as threeor more prediction models). In this manner, predictions from one or moreprimary prediction models may be analyzed based on one or more otherprediction models to determine if a prediction is to be indicated to theuser.

The selection model 160 can be used to determine whether the predictionis to be indicated to the user. For example, the selection model 160determines whether the prediction from the first prediction model 140 ismore likely associated with the first prediction model 140 or with thesecond prediction model 150. Example implementations of the selectionmodel 160 being used to determine whether to indicate the prediction tothe user are described in the examples herein.

Each of the first prediction model 140, the second prediction model 150,and the selection model 160 may be incorporated in software (such assoftware stored in memory 135) and executed by one or more processors(such as the one or more processors 130), may be incorporated inhardware (such as one or more application specific integrated circuits(ASICs), or may be incorporated in a combination of hardware orsoftware. For example, one or more of the models 140-160 may be codedusing Python for execution by the one or more processors. In addition orto the alternative, one or more of the components 140-160 may becombined into a single component or may be split into additionalcomponents not shown. The particular architecture of the system 100shown in FIG. 1 is but one example of a variety of differentarchitectures within which aspects of the present disclosure may beimplemented.

The system 100 (using the selection model 160) is configured todetermine predictions from the first prediction model 140 as to beindicated to the user or prevented from being indicated to the user.Indicating the predictions to the user is based on a variance of thepredictions from what the second prediction model would predict. In thismanner, a prediction significantly varying from a second predictionmodel's output may be of interest to the user and is thus indicated tothe user. A prediction not varying from the second prediction model'soutput (such as not varying from a Gaussian probability distribution orother parametric distribution attributed to noise) may not be ofinterest to the user and is thus prevented from being indicated to theuser. Implementations of determining to which prediction model aprediction is associated are based on a Bernoulli variable (alsoreferred to as a binary variable). As used herein, a Bernoulli variableis a variable with two discrete values (such as 0 or 1). The Bernoullivariable may be used in a joint loss function associated with bothprediction models to evaluate the predictions. In the examples, thefirst prediction model is associated with the Bernoulli variable valueequal to 1, and the second prediction model is associated with theBernoulli variable value equal to 0. While the examples are provided fortwo prediction models, as noted above, the system 100 may include threeor more prediction models. In this manner, the number of discrete valuesfor the Bernoulli variable may be expanded from two to a multi-valueddiscrete distribution. In a different example, multiple Bernoullivariables that may be two discrete values may be combined to allow forthree or more prediction models to be used. As such, the below examplesof two prediction models are provided for clarity in explaining aspectsof the present disclosure, but the scope of the present disclosure isnot limited to only two prediction models.

Use of a joint loss function associated with the multiple predictionmodels allows for determining with greater accuracy if a specificprediction from the first prediction model 140 is associated with thefirst prediction model 140 over the second prediction model 150 (andthus be indicated to a user). In typical prediction systems, traditionalBayesian methods of determining a confidence based on a loss functionincludes adding a separate variable to the prediction model's lossfunction for a model uncertainty (such as noise). In this manner, theloss function includes a combination of a model uncertainty and anobservation uncertainty, and as a result of the multiple uncertainties,typical methods of analyzing the loss function to determine a confidencefor a specific data point (based on the observation uncertainty) becomesimpossible.

In some implementations, the model uncertainty may also be modeled in asecond prediction model. For example, if the data includes a Gaussiandistribution of noise, the second prediction model may be based on aGaussian distribution associated with a known likelihood function. Inthis manner, the loss function determined for the first prediction modeland the likelihood function known for the second prediction model may becombined to generate a joint loss function associated with bothprediction models. The loss function and the likelihood function bothinclude the model uncertainty that may be used to isolate theobservation uncertainty for determining a confidence in a predictionfrom the loss function. The determined confidence indicates an estimatedlikelihood of the prediction occurring.

In the following examples, the association of a prediction with aspecific prediction model and the determination of a confidence isformulated in terms of a regression problem for time series data (suchas predicting cash flow from input time series data for a business). Forexample, prediction of values may be characterized as a problemincluding auto-regressive delayed values in time series. Each predictionfrom the first prediction model 140 may not be assumed to be associatedwith a parametric probability distribution, but the totality of thepredictions from the first prediction model 140 may be associated with asimilar distribution as used to generate the second prediction model150. The probability distribution of a second prediction model 150 maybe a parametric probability distribution (such as Gaussian, Log-Normal,Poisson, and so on) or not a parametric probability distribution (suchas based on Quantile regression). In the specific examples providedbelow for clarity, the second prediction model 150 is defined by aGaussian probability distribution, which is used to explain theassociated likelihood function and joint loss functions with specificityin the examples. In this manner, the input data (x_(i),y_(i)) isassociated with a regression model parameterized by w. However, anysuitable prediction model may be used as a control prediction model.

The predicted probability distribution p(y_(i)|w, x_(i), z_(i)=1) forthe first prediction model (with z_(i) being the Bernoulli variable withvalues 0 (for the second prediction model) and 1 (for the firstprediction model)) can be represented by a parametric distributionassociated with the second prediction model (in this instance, aGaussian distribution). Under the assumption of a Gaussian distribution,the probability distribution can be represented by a mean (noted as avector of mean values over the training data; μ_(i)(w)∈

^(n)) and a standard deviation (noted as a vector of standard deviationvalues over the training data; σ_(i)(w)∈

^(n)). In this manner, each prediction may be associated with adifferent mean and standard deviation. The predicted probabilitydistribution p(ŷ_(i)|w, x_(i), z_(i)=0) for the second prediction modelbased on a Gaussian distribution is defined as a mean (μ∈

) and standard deviation (σ∈

) of the training data, which may be determined from the group offeature-response pairs of the input data. In the example, the secondprediction model is a low variance naïve prediction model comprised ofthe mean and standard deviation of the training data. However, anysuitable prediction model may be used as the second prediction model.z_(i)(w, x_(i))∈[0,1) is the Bernoulli variable which is used todetermine if a prediction is associated with the first prediction model140 or the second prediction model 150. As noted above, if more than twoprediction models are used, the Bernoulli variable may be a distributionof more than two discrete values based on the number of predictionmodels. The probability of the first prediction model being selected(p(z_(i)=1|w, x_(i))) is also noted as θ_(i)(w, x_(i)). If z_(i) isbinary, the probability of the second prediction model being selected(p(z_(i)=0|w, x_(i))) is defined as 1−θ_(i)(w, x_(i)) since the sum ofthe two probabilities equals 1. In this manner, a portion of eachprediction from the first prediction model is associated with somerepresentation of p(z_(i)=0|w, x_(i)). The larger the portion attributedto such representation, the less likely the prediction is associatedwith the first prediction model 140 (as the second prediction model 150may be just as effective in providing such prediction). Details of thejoint loss function, determining a prediction's association based on thejoint loss function, and use of a Bernoulli variable in the joint lossfunction for determining whether to indicate a prediction to a user aredescribed below with reference to FIGS. 2 and 3.

FIG. 2 shows an illustrative flowchart depicting an example operation200 for indicating a prediction to a user, according to someimplementations. The example operation 200 is described as beingperformed by the system 100 (such as by the one or more processors 130executing instructions to perform operations associated with thecomponents 140-160). At 202, the system 100 determines a prediction by afirst prediction model 140. The first prediction model 140 is associatedwith a loss function. At 204, the system 100 determines whether theprediction is associated with the first prediction model 140 or thesecond prediction model 150 based on a joint loss function. The secondprediction model is associated with a likelihood function, and the jointloss function is based on the loss function and the likelihood function.

In the above example of the predicted probability distributions for thefirst prediction model 140 and the second prediction model 150 in lightof a Bernoulli variable z_(i) (with z_(i) equal to 1 for the firstprediction model 140 and equal to 0 for the second prediction model 150)and based on a Gaussian distribution, the loss function of the firstprediction model 140 is the probability density function for a Gaussiandistribution, as indicated in equation (1) below:

$\begin{matrix}{{p\left( {\left. {\hat{y}}_{i} \middle| w \right.,x_{i},{z_{i} = 1}} \right)} = {\frac{1}{\sigma_{i}\sqrt{2\pi}}e^{{- \frac{1}{2}}{(\frac{y_{i} - \mu_{i}}{\sigma_{i}})}^{2}}}} & (1)\end{matrix}$

The likelihood function of the second prediction model 150 is also aprobability density function for a Gaussian distribution, as indicatedin equation (2) below:

$\begin{matrix}{{p\left( {\left. {\hat{y}}_{i} \middle| w \right.,x_{i},{z_{i} = 0}} \right)} = {\frac{1}{\overset{\_}{\sigma}\sqrt{2\pi}}e^{{- \frac{1}{2}}{(\frac{y_{i} - \overset{\_}{\mu}}{\overset{\_}{\sigma}})}^{2}}}} & (2)\end{matrix}$

As shown in equation (1), the loss function associated with the firstprediction model 140 includes first variables μ_(i) and σ_(i) that areused to generate a probability (which may be referred to as aconfidence) in a prediction ŷ_(i) from the first prediction model 140.As shown in equation (2), the likelihood function associated with thesecond prediction model 150 includes second variables μ and σ that areused to generate a probability that the prediction ŷ_(i) would beprovided by the second prediction model 150 (such as based on where theprediction lies in the Gaussian distribution defined by the mean andstandard deviation). The first variables and the second variablescorrespond to each other. In other words, the variables between themodels are similar types of variables. In the example, both sets ofvariables include a mean and a standard deviation. Other types of lossfunctions and likelihood functions may include different variables usedto characterize the functions (such as a variance, a median, a valuesfor Quantile regression, or other measurements). With similar types ofvariables, the loss function and the likelihood function can be combinedinto a joint loss function that is optimized during training In thismanner, the first prediction model 140 and the second prediction model150 are associated with a joint loss function, and the models may betrained concurrently in optimizing the joint loss function.

A joint loss function (which may also be referred to as a jointlikelihood function l_(i)) based on the Bernoulli variable z_(i) isassociated with a mutual exclusivity between the prediction beingassociated with the first prediction model 140 (z_(i)=1) and theprediction being associated with the second prediction model 150(z_(i)=0). The joint loss function created using the Bernoulli variablez_(i) is indicated in a general form in equation (3) below:

l _(i) =p(ŷ _(i) ,z _(i) |w,x _(i))=p(ŷ _(i) ,z _(i)=1|w,x _(i))+p(ŷ_(i) ,z _(i)=0|w,x _(i))  (3)

The joint loss function indicates the combined probabilities of theprediction if the first prediction model 140 is selected and if thesecond prediction model 150 is selected as being associated with theprediction. p(ŷ_(i), z_(i)=a|w_(i) x_(i)) for a∈[0,1) can be expandedinto a multiplication or dot product of the probability of the Bernoullivariable being a for w and x_(i) and the probability of the predictionbeing ŷ_(i) for w, x_(i) as indicated in equation (4) below:

p(ŷ _(i) ,z _(i) =a|w,x _(i))=p(z _(i) =a|w,x _(i))·p(ŷ _(i) |w,x _(i),z _(i) =a)  (4)

Using equation (4), l_(i) in equation (3) can be expanded into the formindicated in equation (5) below:

l _(i) =p(z _(i)=1|w,x _(i))·p(y _(i) |w,x _(i) ,z _(i)=1)+p(z_(i)=0|w,x _(i))·p(y _(i) |w,x _(i) ,z _(i)=0)   (5)

Since equation (5) of the joint loss function is in a general form, theequation may be used for any noise model to determine a joint lossfunction for two prediction models. If three or more prediction modelsare to be used, equation (4) may be used to expand equation (5) for thedesired number of prediction models. Referring back to equation (5) fortwo prediction models 140 and 150 for training the first predictionmodel 140 and the second prediction model 150, the joint loss functionis optimized (which is described below with reference to FIG. 3). Theprobability l_(i) is for given x_(i) for integer i. The totalprobability/likelihood L for a prediction ŷ_(i) across all i from 1 to Nin the input data is defined as the product of all l_(t) for i from 1 toN, as indicated in equation (6) below:

L:=Π _(i=1) ^(N) l _(i)  (6)

Equation (6) of the total likelihood function is also in a general form,and the equation may be used for any specific joint loss function todetermine a total likelihood function.

Referring back to 204 in FIG. 2, determining whether the prediction isassociated with the first prediction model 140 or the second predictionmodel 150 may include determining the probability p(z_(i)=1|w, x_(i))(also referred to as θ_(i)). In some implementations, the system 100(using the selection model 160) determines that the prediction isassociated with the first prediction model 140 if θ_(i) is greater thana threshold, and the system 100 determines that the prediction isassociated with the second prediction model 140 if θ_(i) is less thanthe threshold. In some implementations, different thresholds areassociated with the first prediction model 140 and the second predictionmodel 150. In this manner, θ_(i) between a lower threshold associatedwith the second prediction model 150 and an upper threshold associatedwith the first prediction model 140 may indicate that the system 100 isfuzzy in selecting either prediction model. In other words, as θ_(i)approaches ½, the prediction may be as easily associated with the secondprediction model 150 as with the first prediction model 140. In thismanner, determining which prediction model to which the prediction isassociated is based on the joint loss function.

At 206, in response to determining that the prediction is associatedwith the first prediction model 140 (such as θ_(i) being greater than athreshold), the system 100 indicates the prediction to a user (such asvia the interface 110). In some implementations, if the system 100determines that the prediction is associated with the second predictionmodel 150, the system 100 prevents indicating the prediction to theuser. In this manner, the system 100 filters which predictions from thefirst prediction model 140 are presented to the user based on whetherthe prediction is attributed to the first prediction model of interestto the user. In addition or to the alternative, the system 100 mayindicate that a prediction is filtered or any other suitable indicationthat the prediction is not associated with the first prediction model140.

While not shown, determining whether the prediction is indicated to auser is based on the confidence in the prediction. For example, if atotal likelihood L in the prediction is less than a threshold, theprediction is not indicated to the user. In some other examples, theindication of the prediction may be accompanied with an indication ofthe confidence or another suitable indication. As a result, predictionswith a low confidence are not presented to the user or are explained tothe user to understand the low confidence.

Before the prediction models 140 and 150 are used by the system 100 topredict future cash flow (or any other suitable metrics) and theselection model 160 is used in determining whether the predicted cashflow is to be indicated to the user, the prediction models 140 and 150are trained using a training set of data (such as historic transactiondata and measured cash flow). In typical training of a prediction model,the variables of the loss function are tuned over epochs of the trainingdata to minimize the overall loss for predictions. As used herein,minimizing a loss function refers to reducing the output of the lossfunction over epochs of the training data. If the output is not reducedby more than a threshold over a consecutive number of epochs, the lossfunction may be determined to be minimized using the latest variablesdetermined for the loss function. In one example, the Adam trainingmodel may be used to optimize a loss function.

If training of the models 140 and 150 would be performed independent ofeach other, the one or more first variables are not determined withreference to the one or more second variables (and vice versa). Inaddition, training of a Bernoulli variable in optimizing a joint lossfunction would not occur. As a result, the predictions from oneprediction model may not correlate to predictions from the other model.In some implementations, the first prediction model 140 and the secondprediction model 150 are trained concurrently by optimizing a joint lossfunction. As noted above, the joint loss function includes the one ormore first variables from the loss function associated with the firstprediction model 140 and the one or more second variables from thelikelihood function associated with the second prediction model 150. Inoptimizing the joint loss function, the one or more first variables andthe one or more second variables are determined with reference to eachother to optimize the overall output from the joint loss function. Inaddition, the Bernoulli variables across the training dataset points aredetermined to optimize the overall output from the joint loss function.In this manner, predictions from the models that are trainedconcurrently correlate to each other.

With the joint loss function being based on a Bernoulli variable (suchas z_(i) in equation (5) above to determine l_(i), which is used todetermine total likelihood L in equation (6) above), optimizing thejoint loss function includes determining the one or more first variablesand the one or more second variables to: (i) increase the output of thetotal likelihood function (with the total likelihood indicating aconfidence in the prediction) and (ii) adjust p(z_(i)=1|w, x_(i)) in thetotal likelihood function towards 0 or 1 (and away from ½). In thismanner, the Bernoulli variable may be trained in optimizing the jointloss function. In some implementations, increasing the output of thetotal likelihood function may include minimizing the negative loglikelihood function for the total likelihood (as described below).

FIG. 3 shows an illustrative flowchart depicting an example operation300 for training prediction models used in determining a prediction,according to some implementations. The prediction models to be trainedin describing the example operation 300 include the first predictionmodel 140 and the second prediction model 150 of the system 100 inFIG. 1. The training may be performed by the system 100 or may beperformed by another suitable system or device (with the trained modelsbeing provided to the system 100 via the interface 110). The operation300 is described by being performed by the system 100 in the belowexamples exclusively for clarity in describing the operation.

At 302, the system 100 obtains a loss function associated with the firstprediction model 140 (with the loss function including one or more firstvariables). At 304, the system 100 obtains a likelihood functionassociated with the second prediction model 150 (with the likelihoodfunction including one or more second variables). At 306, the system 100determines a joint loss function based on the loss function and thelikelihood function. In some implementations, the joint loss function isdetermined using equations (5) and (6) above and is provided to thesystem 100 for training the prediction models 140 and 150. In some otherimplementations, the system 100 generates the joint loss function basedon equations (5) and (6) above. As noted above in equation (5),determining the joint loss function may include combining the lossfunction and the likelihood function into a single function based on aBernoulli variable (308). The single function indicates a variance ofthe first data point from a probability distribution associated with thesecond prediction model 150. With the Bernoulli variable, θ_(i)approaching 1 indicates that the variance is increasing, and θ_(i)approaching 0 indicates that the variance is decreasing. In this manner,the joint loss function is associated with a mutual exclusivity betweenthe first data point as the prediction and the second data point as theprediction, and outputs of the joint loss function (such ascorresponding to a total likelihood) may be used in selecting either thefirst prediction model 140 or the second prediction model 150 as beingassociated with the prediction (not both).

At 310, the system 100 optimizes the joint loss function to concurrentlytrain the first prediction model 140 and the second prediction model150. Optimizing the joint loss function may also include training theBernoulli variable as to when the variable is 0 and when the variable is1 (or other values if more than two prediction models) for the trainingset of data. In some implementations, optimizing the joint loss functionincludes applying a training set of data to the first prediction model140 and to the second prediction model 150 to generate values for theone or more first variables and the one or more second variables of thejoint loss function (312). In the above example of the first variablesand the second variables including means and standard deviations, thesystem 100 determines the means and standard deviations to optimize thejoint loss function so that the total likelihood increases. For example,the system 100 determines the means and standard deviations to minimizea negative log likelihood function based on total likelihood.

A specific example of determining the joint loss function and optimizingthe joint loss function is provided below with reference to the noisebeing modeled as a Gaussian distribution and the second prediction model150 being defined as a Gaussian distribution (as described withreference to equations (1) and (2) above). The specific example isprovided for clarity in explaining aspects of the joint loss function(and total likelihood function). It is apparent from the below examplethat the steps may be performed for any joint loss function determinedfor any suitable first prediction model 140 and second prediction model150.

In some implementations, optimizing the joint loss function includesminimizing the negative log likelihood function for the total likelihoodL. The negative log likelihood (−log(L)) based on equation (6) above isindicated in a general form in equation (7) below:

−log(L)=−log(Π_(i=1) ^(N) l _(i))=−Σ_(i=1) ^(N) log l _(i)  (7)

In the example, an output of the predicted probability distributionp(ŷ_(i)|w, x_(i), z_(i)=1) includes variables [μ_(i)(x_(i), w),σ_(i)(x_(i), w), θ_(i)(w, x_(i))] (with θ_(i) being a notation ofp(z_(i)=1|w, x_(i)) indicating the probability that the first predictionmodel 140 is selected), and the predicted probability distributionp(ŷ_(i)|w, x_(i), z_(i)=1) is assumed to follow a Gaussian distributionN(μ_(i) (x_(i), w), σ_(i)(x_(i), w)). With the above assumptions, thejoint loss function for l_(i) (indicated in a general form in equation(5) above) is defined for the specific example in equation (8) below:

l _(i) =p(z _(i)=1|w,x _(i))·p(ŷ _(i) |w,x _(i) ,z _(i)=1)+p(z_(i)=0|w,x _(i))·p(ŷ _(i) |w,x _(i) ,z _(i)=0)  (8)

Replacing p(z_(i)=1|w, x_(i)) and p(z_(i)=0|w, x_(i)) with the θ_(i) and1−θ_(i) notation, respectively (since the sum of the probabilitiesequals 1), yields equation (9) below:

l _(i)=θ_(i) ·p(ŷ _(i) |w,x _(i) ,z _(i)=1)+(1−θ_(i))·p(ŷ _(i) |w,x _(i),z _(i)=0)  (9)

For the example, substituting p(ŷ_(i)|w, x_(i), z_(i)=1) and p(ŷ_(i)|w,x_(i), z_(i)=0) with the terms from equations (1) and (2) above,respectively, yields equation (10) below:

$\begin{matrix}{l_{i} = {{\theta_{i} \cdot \frac{1}{\sigma_{i}\sqrt{2\pi}} \cdot e^{{- \frac{1}{2}}{(\frac{y_{i} - \mu_{i}}{\sigma_{i}})}^{2}}} + {{\left( {1 - \theta_{i}} \right) \cdot \frac{1}{\overset{\_}{\sigma}\sqrt{2\pi}}}e^{{- \frac{1}{2}}{(\frac{y_{i} - \overset{\_}{\mu}}{\overset{\_}{\sigma}})}^{2}}}}} & (10)\end{matrix}$

Equation (10) can be rewritten as equation (11) below:

$\begin{matrix}{l_{i} = {\frac{\theta_{i}}{\sigma_{i}\sqrt{2\pi}} \cdot {e^{{- \frac{1}{2}}{(\frac{y_{i} - \mu_{i}}{\sigma_{i}})}^{2}}\left( {1 + {\frac{\left( {1 - \theta_{i}} \right)\sigma_{i}}{\theta_{i}\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}}} & (11)\end{matrix}$

For the joint loss function in equation (11), the total likelihood L isdefined as in equation (12) below:

$\begin{matrix}{L:={\prod_{i = 1}^{N}\left( {\frac{\theta_{i}}{\sigma_{i}\sqrt{2\pi}} \cdot {e^{{- \frac{1}{2}}{(\frac{y_{i} - \mu_{i}}{\sigma_{i}})}^{2}}\left( {1 + {\frac{\left( {1 - \theta_{i}} \right)\sigma_{i}}{\theta_{i}\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}} \right)}} & (12)\end{matrix}$

As noted above, optimizing the joint loss function may includeincreasing the total likelihood L, such as minimizing the negative loglikelihood function −log(L). The log likelihood function based on L inequation (12) is provided in equation (13) below:

$\begin{matrix}{{\log(L)} = {{- {\sum_{i = 1}^{N}\left( {\frac{\left( {y_{i} - \mu_{i}} \right)^{2}}{2\sigma_{i}^{2}} + {\log\left( \frac{\sigma_{i}}{\theta_{i}} \right)} - {\log\left( {1 + {\frac{1 - \theta_{i}}{\theta_{i}} \cdot \frac{\sigma_{i}}{\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}} \right)}} - {{N\log}\sqrt{2\pi}}}} & (13)\end{matrix}$

As shown in equation (13), the two overall terms of the log likelihoodfunction are written to be expressed as negative terms (with bothincluding a minus sign). In this manner, the negative log likelihoodfunction to be minimized in training the first prediction model 140 andthe second prediction model 150 for the specific example is provided inequation (14) below:

$\begin{matrix}{{- {\log(L)}} = {{\sum_{i = 1}^{N}\left( {\frac{\left( {y_{i} - \mu_{i}} \right)^{2}}{2\sigma_{i}^{2}} + {\log\left( \frac{\sigma_{i}}{\theta_{i}} \right)} - {\log\left( {1 + {\frac{1 - \theta_{i}}{\theta_{i}} \cdot \frac{\sigma_{i}}{\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}} \right)} + {{N\log}\sqrt{2\pi}}}} & (14)\end{matrix}$

While it is noted that training the prediction models may includedetermining the one or more first values and the one or more secondvalues to minimize the negative log likelihood function, the one or morefirst values and the one or more second values may also be determined toensure that p(z_(i)=1|w, x_(i)) (also referred to as θ_(i)) is towards 0or 1 instead of ½. If θ_(i) approaches ½ instead of 0 or 1, there is afuzziness in selecting either the first prediction model or the secondprediction model for the prediction. In other words, a probability of ½indicates that the system 100 is just as likely to pick one predictionmodel over the other.

In some implementations, another term is added to the joint lossfunction to prevent such fuzziness. The term causes the probability toshift towards 0 or 1. An example term may include (θ_(i)−½)² or|θ_(i)−½|. The term may be accompanied with a tunable parameter λ andcombined with the combined loss function and the likelihood function.For the specific example of a joint loss function in equation (11)above, the term (with the tunable parameter λ) may be added to the jointloss function, such as indicated in equation (15) below:

$\begin{matrix}{{l_{i} + {\lambda*{{\theta_{i} - \frac{1}{2}}}}} = {{\frac{\theta_{i}}{\sigma_{i}\sqrt{2\pi}} \cdot {e^{{- \frac{1}{2}}{(\frac{y_{i} - \mu_{i}}{\sigma_{i}})}^{2}}\left( {1 + {\frac{\left( {1 - \theta_{i}} \right)\sigma_{i}}{\theta_{i}\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}} + {\lambda*{{\theta_{i} - \frac{1}{2}}}}}} & (15)\end{matrix}$

The total likelihood L based on l_(i) is the same as described above. Inthis manner, the log likelihood may be the same as in equation (13)above). A sum of the additional term across all i (such as λ*Σ_(i=1)^(N)|θ_(i)−½|) may be added to the log likelihood function (such as toequation (13)). In this manner, the function to be minimized (such asbased on a negative log likelihood in equation (14) with the constant Nlog √{square root over (2π)} removed) is provided in equation (16)below:

$\begin{matrix}{{{\min_{w}{\left\lbrack {{\sum_{i = 1}^{N}\left( {\frac{\left( {y_{i} - \mu_{i}} \right)^{2}}{2\sigma_{i}^{2}} + {\log\left( \frac{\sigma_{i}}{\theta_{i}} \right)} - {\log\left( {1 + {\frac{1 - \theta_{i}}{\theta_{i}} \cdot \frac{\sigma_{i}}{\overset{\_}{\sigma}} \cdot e^{{\frac{1}{2\sigma_{i}^{2}}{({y_{i} - \mu_{i}})}^{2}} - {\frac{1}{2{\overset{\_}{\sigma}}^{2}}{({y_{i} - \overset{\_}{\mu}})}^{2}}}}} \right)}} \right)} - {\lambda*{\sum_{i = 1}^{N}{{\theta_{i} - \frac{1}{2}}}}}} \right\rbrack{\forall\theta_{i}}}},\mspace{20mu}{0 \leq \theta_{i} \leq 1}}\mspace{70mu}} & (16)\end{matrix}$

Equation (16) is a joint optimization problem regarding the set of firstvariables and second variables and regarding the probabilities θ_(i). Inthis manner, optimizing the joint loss function by minimizing thefunction in equation (16) is in consideration of adjusting θ_(i) awayfrom ½ to prevent fuzziness in determining which prediction model.

As described above, a system is configured to filter which predictionsare to be indicated to a user and indicating such predictions to a user.The predictions that are indicated to a user are determined by a firstprediction model and then compared to a second prediction model todetermine a variance of the prediction from the second prediction model.Operations in indicating a prediction to a user and preventing anindication of a prediction to the user based on a joint loss function,training the models based on optimizing the joint loss function, andother suitable operations are described in the above examples forexplaining aspects of the present disclosure.

FIG. 4 shows an illustrative flow chart depicting an example operation400 for dynamically selecting a forecasting model, according to someother implementations. The example operation 400 may be performed by oneor more processors of a computing device associated with the forecastingsystem. In some implementations, the example operation 400 may beperformed using the ML augmented forecasting system 100 of FIG. 1. It isto be understood that the example operation 400 may be performed by anysuitable systems, computers, or servers.

At block 402, the ML augmented forecasting system 100 retrieves a numberof data points. At block 404, the ML augmented forecasting system 100generates, using a machine learning model, confidence values indicatingwhether a first forecasting model or a second forecasting model is morelikely to generate an accurate prediction for each of the number of datapoints. At block 406, the ML augmented forecasting system 100 selectsthe first forecasting model or the second forecasting model for each ofthe number of data points based on the respective confidence values. Atblock 408, the ML augmented forecasting system 100 generates, for eachof the number of data points, prediction data using the selected one ofthe first forecasting model or the second forecasting model. At block410, the ML augmented forecasting system 100 trains the machine learningmodel to generate more accurate confidence values for data points basedon the prediction data.

In some implementations, the machine learning model may be trained toautomatically generate confidence values that indicate whether aprediction is from a first forecasting model or a second forecastingmodel. In this manner, the machine learning model is trained tooptimize, during the training process, between the first forecastingmodel and the second forecasting model.

In some aspects, each of the number of data points is associated with acorresponding customer. In some implementations, the ML augmentedforecasting system may output, for each data point, at least one of theselected forecasting model or the prediction data to the correspondingcustomer. In some other implementations, generating the confidencevalues is based on a predicted mean associated with the firstforecasting model, a predicted standard deviation associated with thefirst forecasting model, a fixed mean associated with the secondforecasting model, and a fixed standard deviation associated with thesecond forecasting model. In some aspects, a confidence value of 1indicates that the first forecasting model is more likely to generate anaccurate prediction for a respective data point than the secondforecasting model, and a confidence value of 0 indicates that the firstforecasting model is less likely to generate an accurate prediction forthe respective data point than the second forecasting model.

In some implementations, training the machine learning model includesgenerating a joint uncertainty function associated with the first andsecond forecasting model. In some instances, the ML augmentedforecasting system may identify a first uncertainty function associatedwith the first forecasting model, the first uncertainty functionindicating a degree of error of the first forecasting model for a givendata point, and identify a second uncertainty function associated withthe second forecasting model, the second uncertainty function indicatinga degree of error of the second forecasting model for the given datapoint, where generating the joint uncertainty function is based on thefirst uncertainty function and the second uncertainty function. In someother instances, the ML augmented forecasting system may update one ormore previous joint uncertainty functions with the generated jointuncertainty function.

In some other implementations, the ML augmented forecasting system maygenerate a total likelihood value jointly associated with the first andsecond forecasting model based on the joint uncertainty function. Insome instances, the ML augmented forecasting system may generate aloglikelihood value jointly associated with the first and secondforecasting model and a negative loglikelihood value jointly associatedwith the first and second forecasting model based on the totallikelihood value.

As used herein, a phrase referring to “at least one of” a list of itemsrefers to any combination of those items, including single members. Asan example, “at least one of: a, b, or c” is intended to cover: a, b, c,a-b, a-c, b-c, and a-b-c.

Unless specifically stated otherwise as apparent from the followingdiscussions, it is appreciated that throughout the present application,discussions utilizing the terms such as “accessing,” “receiving,”“sending,” “using,” “selecting,” “determining,” “normalizing,”“multiplying,” “averaging,” “monitoring,” “comparing,” “applying,”“updating,” “measuring,” “deriving” or the like, refer to the actionsand processes of a computer system, or similar electronic computingdevice, that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories into other data similarly represented as physical quantitieswithin the computer system memories or registers or other suchinformation storage, transmission or display devices.

The various illustrative logics, logical blocks, modules, circuits, andalgorithm processes described in connection with the implementationsdisclosed herein may be implemented as electronic hardware, computersoftware, or combinations of both. The interchangeability of hardwareand software has been described generally, in terms of functionality,and illustrated in the various illustrative components, blocks, modules,circuits and processes described above. Whether such functionality isimplemented in hardware or software depends upon the particularapplication and design constraints imposed on the overall system.

The hardware and data processing apparatus used to implement the variousillustrative logics, logical blocks, modules and circuits described inconnection with the aspects disclosed herein may be implemented orperformed with a general purpose single- or multi-chip processor, adigital signal processor (DSP), an application specific integratedcircuit (ASIC), a field programmable gate array (FPGA) or otherprogrammable logic device, discrete gate or transistor logic, discretehardware components, or any combination thereof designed to perform thefunctions described herein. A general purpose processor may be amicroprocessor or any conventional processor, controller,microcontroller, or state machine. A processor also may be implementedas a combination of computing devices such as, for example, acombination of a DSP and a microprocessor, a plurality ofmicroprocessors, one or more microprocessors in conjunction with a DSPcore, or any other such configuration. In some implementations,particular processes and methods may be performed by circuitry that isspecific to a given function.

In one or more aspects, the functions described may be implemented inhardware, digital electronic circuitry, computer software, firmware,including the structures disclosed in this specification and theirstructural equivalents thereof, or in any combination thereof.Implementations of the subject matter described in this specificationalso can be implemented as one or more computer programs, i.e., one ormore modules of computer program instructions, encoded on a computerstorage media for execution by, or to control the operation of, dataprocessing apparatus.

If implemented in software, the functions may be stored on ortransmitted over as one or more instructions or code on acomputer-readable medium. The processes of a method or algorithmdisclosed herein may be implemented in a processor-executable softwaremodule which may reside on a computer-readable medium. Computer-readablemedia includes both computer storage media and communication mediaincluding any medium that can be enabled to transfer a computer programfrom one place to another. A storage media may be any available mediathat may be accessed by a computer. By way of example, and notlimitation, such computer-readable media may include RAM, ROM, EEPROM,CD-ROM or other optical disk storage, magnetic disk storage or othermagnetic storage devices, or any other medium that may be used to storedesired program code in the form of instructions or data structures andthat may be accessed by a computer. Also, any connection can be properlytermed a computer-readable medium. Disk and disc, as used herein,includes compact disc (CD), laser disc, optical disc, digital versatiledisc (DVD), floppy disk, and Blu-ray disc where disks usually reproducedata magnetically, while discs reproduce data optically with lasers.Combinations of the above should also be included within the scope ofcomputer-readable media. Additionally, the operations of a method oralgorithm may reside as one or any combination or set of codes andinstructions on a machine readable medium and computer-readable medium,which may be incorporated into a computer program product.

Various modifications to the implementations described in thisdisclosure may be readily apparent to those skilled in the art, and thegeneric principles defined herein may be applied to otherimplementations without departing from the spirit or scope of thisdisclosure. Thus, the claims are not intended to be limited to theimplementations shown herein, but are to be accorded the widest scopeconsistent with this disclosure, the principles and the novel featuresdisclosed herein.

What is claimed is:
 1. A method for dynamically selecting a forecastingmodel, the method performed by one or more processors of a forecastingsystem and comprising: retrieving a number of data points; generating,using a machine learning model, confidence values indicating whether afirst forecasting model or a second forecasting model is more likely togenerate an accurate prediction for each of the number of data points;selecting the first forecasting model or the second forecasting modelfor each of the number of data points based on the respective confidencevalues; generating, for each of the number of data points, predictiondata using the selected one of the first forecasting model or the secondforecasting model; and training the machine learning model to generatemore accurate confidence values for data points based on the predictiondata.
 2. The method of claim 1, wherein each of the number of datapoints is associated with a corresponding customer.
 3. The method ofclaim 2, further comprising: outputting, for each data point, at leastone of the selected forecasting model or the prediction data to thecorresponding customer.
 4. The method of claim 1, wherein generating theconfidence values is based on a predicted mean associated with the firstforecasting model, a predicted standard deviation associated with thefirst forecasting model, a fixed mean associated with the secondforecasting model, and a fixed standard deviation associated with thesecond forecasting model.
 5. The method of claim 1, wherein: aconfidence value of 1 indicates that the first forecasting model is morelikely to generate an accurate prediction for a respective data pointthan the second forecasting model; and a confidence value of 0 indicatesthat the first forecasting model is less likely to generate an accurateprediction for the respective data point than the second forecastingmodel.
 6. The method of claim 1, wherein training the machine learningmodel includes: generating a joint uncertainty function associated withthe first and second forecasting model.
 7. The method of claim 6,further comprising: identifying a first uncertainty function associatedwith the first forecasting model, the first uncertainty functionindicating a degree of error of the first forecasting model for a givendata point; and identifying a second uncertainty function associatedwith the second forecasting model, the second uncertainty functionindicating a degree of error of the second forecasting model for thegiven data point, wherein generating the joint uncertainty function isbased on the first uncertainty function and the second uncertaintyfunction.
 8. The method of claim 6, further comprising: updating one ormore previous joint uncertainty functions with the generated jointuncertainty function.
 9. The method of claim 6, wherein training themachine learning model further includes: generating a total likelihoodvalue jointly associated with the first and second forecasting modelbased on the joint uncertainty function.
 10. The method of claim 9,wherein training the machine learning model further includes: generatinga loglikelihood value jointly associated with the first and secondforecasting model and a negative loglikelihood value jointly associatedwith the first and second forecasting model based on the totallikelihood value.
 11. A system for dynamically selecting a forecastingmodel, the system comprising: one or more processors; and a memorystoring instructions that, when executed by the one or more processors,causes the system to: retrieve a number of data points; generate, usinga machine learning model, confidence values indicating whether a firstforecasting model or a second forecasting model is more likely togenerate an accurate prediction for each of the number of data points;select the first forecasting model or the second forecasting model foreach of the number of data points based on the respective confidencevalues; generate, for each of the number of data points, prediction datausing the selected one of the first forecasting model or the secondforecasting model; and train the machine learning model to generate moreaccurate confidence values for data points based on the prediction data.12. The system of claim 11, wherein each of the number of data points isassociated with a corresponding customer.
 13. The system of claim 12,wherein execution of the instructions further causes the system to:output, for each data point, at least one of the selected forecastingmodel or the prediction data to the corresponding customer.
 14. Thesystem of claim 11, wherein generating the confidence values is based ona predicted mean associated with the first forecasting model, apredicted standard deviation associated with the first forecastingmodel, a fixed mean associated with the second forecasting model, and afixed standard deviation associated with the second forecasting model.15. The system of claim 11, wherein: a confidence value of 1 indicatesthat the first forecasting model is more likely to generate an accurateprediction for a respective data point than the second forecastingmodel; and a confidence value of 0 indicates that the first forecastingmodel is less likely to generate an accurate prediction for therespective data point than the second forecasting model.
 16. The systemof claim 11, wherein training the machine learning model includes:generating a joint uncertainty function associated with the first andsecond forecasting model.
 17. The system of claim 16, wherein executionof the instructions further causes the system to: identify a firstuncertainty function associated with the first forecasting model, thefirst uncertainty function indicating a degree of error of the firstforecasting model for a given data point; and identify a seconduncertainty function associated with the second forecasting model, thesecond uncertainty function indicating a degree of error of the secondforecasting model for the given data point, wherein generating the jointuncertainty function is based on the first uncertainty function and thesecond uncertainty function.
 18. The system of claim 16, whereinexecution of the instructions further causes the system to: update oneor more previous joint uncertainty functions with the generated jointuncertainty function.
 19. The system of claim 16, wherein training themachine learning model further includes: generating a total likelihoodvalue jointly associated with the first and second forecasting modelbased on the joint uncertainty function.
 20. The system of claim 19,wherein training the machine learning model further includes: generatinga loglikelihood value jointly associated with the first and secondforecasting model and a negative loglikelihood value jointly associatedwith the first and second forecasting model based on the totallikelihood value.